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Abstract 


Benchmarks  can  be  useful  in  estimating  the  performance  of  a  computer  system 
when  it  is  not  possible  or  practical  to  test  out  the  new  system  with  an  actual 
workload.  In  the  field  of  high  performance  computing,  some  common 
benchmarks  are  the  various  versions  of  Linpack,  the  various  versions  of  the 
Numerical  Aerospace  Simulation  Systems  Division  of  NASA  Ames  Research 
Center  (NAS)  benchmarks,  and  the  STREAMS  benchmark,  as  well  as  older  and 
less  frequently  referenced  benchmarks  such  as  the  Livermore  Loops.  There  are 
also  those  who  recommend  estimating  the  performance  based  solely  on  the  peak 
speed  of  the  computer  systems.  Unfortunately,  the  per  processor  levels  of 
performance  measured  using  these  benchmarks  can  vary  by  1  to  2  orders  of 
magnitude  for  the  same  system.  Therefore,  one  has  to  ask,  which  benchmark(s) 
should  we  be  looking  at?  This  report  attempts  to  answer  that  question  by 
comparing  the  measured  performance  for  a  variety  of  real  world  codes  to  the 
measured  performance  of  the  standard  benchmarks  when  run  of  systems  of 
interest  to  the  Department  of  Defense  (DOD)  High  Performance  Computing 
Modernization  Program. 
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1.  Introduction 


During  the  summer  of  the  year  2000,  as  part  of  his  student  internship  at  the  ARL- 
MSRC,*  Jelani  Clay,  under  the  supervision  of  Daniel  M.  Pressel,  investigated  the 
following  question:  Which,  if  any,  of  the  industry  standard  benchmarks 
adequately  predict  the  performance  of  real  world  codes  on  systems  of  interest  to 
the  DOD  HPCMP?  Several  benchmarks  have  been  proposed  for  this  purpose, 
including  the  following: 

•  the  theoretical  peak  performance  of  the  system, 

•  the  current  SPEC  benchmarks, 

•  one  or  more  of  the  Unpack  family  of  benchmarks, 

•  the  Livermore  Loops, 

•  the  STREAMS  benchmark,  and 

•  some  of  the  NAS  family  of  benchmarks. 

We  concluded  that  the  SPEC  benchmarks  were  primarily  single-processor 
benchmarks  aimed  at  workstation  class  systems  and  therefore  deleted  them  from 
our  list.  Micro  benchmarks  that  seemed  to  be  aimed  at  measuring  the 
performance  of  a  specific  feature  of  the  architecture  were  deleted.  This  included 
benchmarks  for  FFTs,  Matrix  Multiply,  various  cache  benchmarks,  etc.  It  was 
also  felt  that  the  Uvermore  Loops  were  generally  considered  to  be  obsolete  and 
rarely  reported  anymore.  The  final  selection  included  the  following  benchmarks 
and  datasets: 

•  the  theoretical  peak  performance  of  the  system, 

•  the  Linpack  Benchmark-Parallel  when  the  data  was  available, 
supplemented  with  results  for  the  Linpack  N=1000  benchmark, 

•  the  STREAMS  benchmark,  and 

•  the  NAS  NPB  2  benchmarks  for  the  class  B  data  set  (BT,  CG,  LU,  and  SP), 
supplemented  with  results  for  the  class  A  data  set. 

Following  this,  a  search  of  conference  papers  and  websites  related  to  high 
performance  computing  was  undertaken  with  the  goal  of  finding  published 
performance  results  for  as  wide  a  range  of  programs  as  possible.  Unfortunately, 
this  required  us  to  be  able  to  determine  as  precisely  as  possible  the  following 
three  things: 


*  Definitions  for  boldface  text  can  be  found  in  the  Glossary. 
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(1)  What  system  was  being  used  (e.g.,  simply  knowing  that  the  system  was  an 
SGI  Origin  2000  with  a  R10000  processor  or  an  IBM  SP  with  a  P2SC 
processor  was  not  sufficient  if  we  did  not  know  the  processor  speed)? 

(2)  How  many  processors  were  used? 

(3)  What  was  the  performance  in  MFLOPS  per  processor  or  some  other  unit 
that  could  readily  be  converted  to  this  unit? 

The  problem  was  that  many  other  excellent  papers  were  missing  one  or  more  of 
these  numbers.  In  rare  instances,  sufficient  information  existed  from  other 
sources  that  we  were  able  to  fill  in  the  blanks.  However,  in  an  unfortunately 
large  number  of  cases,  we  had  to  discontinue  our  search  and  proceed  with  our 
research. 

After  analyzing  all  of  the  data  that  was  collected,  we  arrived  at  the  following 
conclusions: 

(1)  The  peak  speed  of  the  system  is  a  particularly  bad  predictor  of  system 
performance. 

(2)  The  Linpack  benchmarks  closely  track  the  peak  system  speed  and  therefore 
suffer  from  the  same  failing. 

(3)  The  STREAMS  benchmark  is  primarily  a  serial  benchmark  and  says  very 
little  about  the  scalability  of  the  system.  It  also  tends  to  underpredict  the 
performance  of  single-processor  runs. 

(4)  The  NAS  benchmarks  support  several  data  sets  (classes  A— small,  B— 
medium,  C— large,  and  W— "workstation")  and  come  in  four  main  flavors 
(NPB  1— pencil  and  paper,  NPB  2— MPI,  and  experimental  versions  based 
on  HPF  and  OpenMP).  The  NPB  2  results  produce  a  range  of  performance 
numbers  which  seem  to  correspond  closely  with  the  performance  results 
seen  by  many  real  world  codes. 


2.  Methodology 


The  ideal  methodology  is  to  determine  which  systems  are  located  at  the  major 
sites  of  interest  (e.g.,  systems  located  at  the  MSRCs  and  the  larger  DCs)  to  the 
target  audience  (e.g.,  the  Users  Group  for  the  DOD  HPCMP).  Next,  one  must  try 
to  determine  which  benchmarks  are  the  most  relevant  to  the  problem  domain  in 
question.  In  the  case  of  this  report,  the  problem  domain  is  HPC  applications— 
particularly  those  applications  that  are  routinely  run  using  at  least  100  processors 
for  a  single  job.  As  such,  we  investigated  a  large  number  of  commonly  referenced 
benchmarks  and  found: 
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•  The  TPC  benchmarks  are  heavily  oriented  towards  database  and  not  HPC 
applications  and  are  therefore  not  relevant  to  this  study. 

•  The  SPEC  benchmarks  are  relatively  small  serial  benchmarks  aimed  at  the 
desktop/ deskside  market  and,  again,  lacked  relevancy. 

•  Benchmarks  such  as  Dhrystone  and  Whetstone  are  obsolete  and  rarely 
mentioned  anymore.  Furthermore,  they  were  designed  to  measure  the  total 
instruction  execution  rate,  not  just  the  floating  point  execution  rate,  on 
single  processor  departmental  servers  circa  1980s. 

•  Benchmarks  such  as  the  four  "FLOPS”  benchmarks  maintained  by  Alfred 
Aburto  of  the  Naval  Ocean  Systems  Center,  San  Diego,  CA,  are  slightly 
better  in  that  they  only  deal  with  floating  point  operations.  However,  they 
still  fail  to  address  the  need  for  a  parallel  benchmark  for  HPC  applications. 

•  Similarly,  we  felt  that  benchmarks  based  on  narrowly  defined 
computational  kernels  (e.g.,  matrix  multiply  or  FFTs)  were  too  narrow  in 
scope  to  be  used  to  benchmark  an  entire  machine. 

•  Micro  benchmarks  (e.g.,  those  designed  to  investigate  the  caches)  can  be 
quite  useful,  but  not  for  this  study. 

•  Livermore  Loops  looked  more  promising,  but  they  were  found  to  be  dated 
and  rarely  referenced  in  recent  literature. 

Therefore,  we  settled  on  the  following  set  of  benchmarks: 

•  the  theoretical  peak  performance  of  the  system, 

•  the  Linpack  Benchmark-Parallel  when  the  data  was  available, 
supplemented  with  results  for  the  Linpack  N=1000  benchmark, 

•  the  STREAMS  benchmark,  and 

•  the  NAS  NPB  2  benchmarks  for  the  class  B  data  set  (BT,  CG,  LU,  and  SP), 
supplemented  with  results  for  the  class  A  data  set. 

We  then  proceeded  to  collect  the  necessary  data.  Where  data  are  missing,  one 
might  consider  personally  performing  the  runs.  We  chose  not  to  take  this 
approach  and  instead  have  attempted  to  estimate  the  missing  data  points  using 
the  following  approaches: 

•  When  Linpack-Parallel  results  were  not  readily  available,  we  attempted  to 
use  Linpack  N=1000  results.  If  neither  were  available,  but  results  from  a 
similar  system  from  the  same  vendor  (e.g.,  IBM  P2SC 120  MHz  is  similar  to 
the  IBM  P2SC  135  MHz)  were  available,  then  the  results  from  the  similar 
system  were  used,  with  the  performance  scaled  based  on  the  clock  rates. 

•  When  NAS  NPB  2  results  for  the  class  B  data  set  were  not  available,  results 
for  the  class  A  data  set  were  used. 
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•  Once  the  NPB  2  data  set  was  selected,  if  results  for  a  run  using  the  correct 
number  of  processors  could  not  be  found,  then  results  for  the  closest 
number  of  processors  reported  were  used.  In  some  cases,  this  was  1.  This 
could  have  potentially  presented  a  serious  problem  when  comparing  this 
result  to  runs  involving  out  to  100  or  more  processors.  Fortunately,  in  the 
case  of  die  SUN  HPC  10000,  we  were  able  to  substitute  results  for  the 
OpenMP  version  of  this  benchmark.  Hopefully,  this  will  make  for  more 
realistic  comparisons. 

•  Again,  it  was  sometimes  necessary  to  extrapolate  results  from  measured 
systems  to  similar  systems  where  the  data  was  missing.  The  most 
questionable  use  of  this  approach  involved  the  four  IBM  SP  systems  with 
Power  3  processors.  Fortunately,  as  these  systems  have  matured,  additional 
benchmark  results  have  become  available. 

•  For  the  STREAM  benchmark,  it  was  generally  possible  to  obtain  single 
processor  runs.  When  this  was  not  the  case,  and  keeping  in  mind  that  this 
benchmark  was  designed  to  primarily  measure  the  performance  of  the 
memory  system  and  not  the  processor,  we  used  results  for  a  similar  system 
without  any  scaling.  Even  so,  in  the  case  of  die  IBM  SP  with  Power  3 
processors,  this  may  not  have  been  very  accurate  due  to  the  significant 
differences  in  architecture  of  the  memory  systems  for  the  different  types  of 
nodes.  Another  issue  was  that  for  any  SMP  or  system  with  SMP  nodes, 
running  a  job  on  a  single  processor  with  the  other  processors  in  the 
system/node  idle  would  overstate  the  available  memory  bandwidth  on  a 
per-processor  basis  and  therefore  skew  the  results  to  some  extent. 

Once  we  had  the  benchmark  numbers,  those  that  were  not  already  in 
MFLOPS/processor  terms  were  converted  to  that  format.  For  the  NAS 
benchmarks,  we  attempted  to  collect  the  results  for  two  ranges  of  processor 
counts— 100-200  processors  and  more  than  200  processors.  Some  systems  either 
didn't  go  that  large  or  had  not  been  benchmarked  for  the  larger  configurations. 
In  those  cases,  we  had  to  extrapolate  the  data  as  was  previously  mentioned. 

The  results  for  the  real  world  codes  were  collected  from  a  variety  of  sources, 
including  conference  proceedings  and  runs  done  by  employees  of  ARL.  These 
numbers  were  then  grouped  into  three  groups,  depending  on  the  processor 
counts— 1-99  processors,  100-200  processors,  and  more  than  200  processors. 
Again,  the  results  were  expressed  in  terms  of  MFLOPS/processor.  No  attempt 
was  made  to  extrapolate  results  to  systems/ system  configurations  where  data 
was  missing.  In  many  cases,  it  was  clear  that  the  researchers  had  not  continued 
to  higher  processor  counts  either  because  they  had  run  out  of  processors  and/  or 
because  their  jobs  were  no  longer  scaling  well.  In  either  case,  extrapolating  the 
results  did  not  seem  to  be  worthwhile. 
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3.  Observations  and  Results 


Figures  1  and  2  and  Table  1  compare  the  benchmark  data  with  the  peak  speed  of 
the  processors.  The  Unpack  results  closely  track  the  peak  system  speed,  although 
they  have  the  added  benefit  of  tracking  the  scalability  of  the  system  for  certain 
classes  of  codes.  Even  so,  they  tend  to  overpredict  the  performance  in  a  similar 
fashion  to  using  the  peak  speed.  In  general,  the  NAS  and  STREAM  benchmark 
results  were  significantly  slower  than  the  Linpack  benchmark  results.* 

When  comparing  the  NAS  and  STREAM  benchmark  results,  it  was  not  clear  how 
much  of  a  difference  there  was  between  the  results  for  these  two  sets  of 
benchmarks.  Therefore,  we  constructed  Figure  3  and  Table  2  to  compare  the 
single  processor  performance  of  the  NAS  benchmarks  to  the  results  for  the 
STREAM  benchmarks.  One  complication  in  compiling  this  data  is  that  due  to 
memory  constraints,  most  vendors  did  not  report  single  processor  runs  for  the 
NAS  benchmarks.  Therefore,  we  had  to  use  the  runs  done  with  the  smallest 
number  of  processors,  in  the  1-16  processor  range.  From  this,  the  following  two 
things  became  clear: 

(1)  The  single  processor  performance  for  the  NAS  benchmarks  was,  in  general, 
significantly  greater  than  what  the  STREAM  benchmark  was  predicting. 

(2)  By  comparing  the  data  from  Table  1  (Figures  1  and  2)  with  the  data  from 
Table  2  (Figure  3)  for  the  NAS  benchmarks,  one  can  clearly  see  the 
importance  of  taking  the  system  interconnect  into  consideration.  One 
problem  with  this  was  that  each  code  would  interact  with  the  system 
interconnect  in  its  own  way,  making  it  difficult  to  offer  sweeping 
generalizations.  For  this  reason,  we  decided  not  to  pursue  the  STREAM 
benchmark  further.  Additionally,  the  importance  of  separating  out  the 
benchmark  runs  and  real  world  runs  into  groups  based  on  the  number  of 
processors  being  used  became  all  too  clear.+ 


*  The  NAS  benchmarks  support  several  data  sets  (classes  A — small,  B — medium,  C — large,  and 
W — "workstation")  and  come  in  four  main  flavors  (NPB  1 — pencil  and  paper,  NPB  2 — MPI,  and 
experimental  versions  based  on  HPF  and  OpenMP).  We  found  that  the  NPB  1  results  were  usually 
significantly  faster  than  the  NPB  2  results  and  probably  should  be  considered  to  be  overly 
optimistic  for  most  real  world  codes.  Results  for  HPF  and  OpenMP  were  not  generally  available  for 
most  systems  and  therefore  were  not  analyzed.  The  NPB  2  results  produce  a  range  of  performance 
numbers  that  seem  to  correspond  closely  with  the  performance  results  seen  by  many  real  world 
codes.  The  main  drawback  to  using  the  NPB  2  results  is  the  difficulty  of  obtaining  numbers  for  new 
systems,  since  the  NAS  group  at  NASA  Ames  has  not  recently  posted  new  results  to  their  website. 

f  If  the  reader  compares  the  relative  values  for  the  NAS  CG  and  the  STREAM  benchmark 
results,  one  will  see  that  the  CG  benchmark  performs  much  better  when  using  only  a  few 
processors  (on  a  per  processor  basis),  while  the  STREAM  benchmark  is  virtually  unaffected  by  the 
number  of  processors  used.  Therefore,  when  looking  for  a  reasonable  lower  bound  on  the 
performance  of  parallel  jobs,  the  NAS  CG  benchmark  looks  like  it  will  be  a  better  choice. 
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Figures  4-7  and  Table  3  contain  our  results  from  mining  the  web  and  a  variety  of 
conference  proceedings  for  results  involving  real  world  codes.  One  can  easily  see 
that  for  many  of  the  systems  a  wide  range  of  performance  was  reported  (e.g.,  one 
order  of  magnitude).  To  simplify  die  comparison,  the  benchmark  results  and  the 
results  for  real  world  codes  were  expressed  in  terms  of  ranges  of  performance, 
with  these  numbers  appearing  in  Figures  7-9  and  Table  4.  This  allowed  us  to 
clearly  see  that  in  many  cases,  the  Linpack  results  significantly  overstated  the 
performance  that  one  was  likely  to  achieve  with  real  world  codes  on  modem 
HPC  systems.  Even  so,  a  small  number  of  extremely  well-tuned  codes  exhibited 
levels  of  performance  that  were  comparable  to  those  reported  for  die  Linpack 
benchmark.  In  most  cases,  the  results  for  the  NAS  benchmarks  as  a  group  were  a 
better  predictor.  Unfortunately,  without  a  more  specific  knowledge  of  the 
algorithms  involved  in  the  real  world  codes,  it  was  difficult  to  be  more  precise  as 
to  what  level  of  performance  any  single  code  would  exhibit.  Even  then,  the 
results  clearly  indicated  that  differences  between  two  data  sets  of  fixed  size  could 
affect  the  scalability  and  performance  of  the  same  code  on  the  same  system. 
There  was  also  the  additional  complication  of  how  much  time,  effort,  and  skill 
the  author  of  a  real  world  code  could  contribute  when  writing  or  porting  a 
program. 


4.  Conclusions 


When  looking  at  the  NAS  NPB  2  benchmarks  (BT,  CG,  LU,  and  SP)  as  a  group, 
their  range  of  performance  on  a  particular  system  of  a  particular  size  range 
seems  to  be  a  good  predictor  of  performance  by  well-tuned  real  world  codes  on 
the  same  system.  In  most  cases,  this  metric  will  be  a  better  choice  than  using 
either  the  STREAM  or  the  Linpack  benchmarks.  We  believe  that  the  class  B  data 
set  for  the  NPB  2  benchmarks  is,  in  general,  the  best  choice;  although  for  smaller 
system  sizes,  class  A  may  also  be  appropriate.  Similarly,  for  larger  system  sizes, 
the  rarely  reported  class  C  data  set  may  be  a  better  choice. 

There  were  two  major  problems  in  carrying  out  this  study: 

(1)  People  have  stopped  reporting  the  NAS  benchmarks  and  in  some  cases,  the 
STREAM  and/or  Linpack  benchmarks,  for  new  systems.  We  recommend 
that  efforts  be  made  to  measure  and  publicly  disseminate  the  performance 
numbers  for  these  benchmarks  for  as  wide  a  range  of  systems/ system 
configurations  as  is  practical. 

(2)  Even  when  the  author  of  a  paper  is  primarily  interested  in  the  science 
aspect  and  not  the  performance  when  measured  in  MFLOPS,  it  would  still 
be  helpful  to  have  such  numbers  reported. 
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It  is  also  important  to  note  that  this  study  has  some  important  limitations. 
Topping  the  list  is  the  question  of  input/ output.  We  feel  that  input/ output  is  a 
sufficiently  complicated  issue  that  is  best  left  to  another  study.  The  same  holds 
true  for  issues  such  as  usability  and  system  stability.  The  results  for  the  MIMD 
version  of  the  F3D  code  demonstrate  that  if  one  attempts  to  implement  a  very 
fine  grained  level  of  parallelism  using  MPI  and  an  MPP  with  a  moderate-to-large 
message  latency,  the  performance  will  suffer  to  the  point  that  none  of  the 
benchmarks  will  accurately  predict  the  level  of  performance.  It  is  best  if  one  can 
avoid  fine  grained  levels  of  parallelism  whenever  possible.  When  that  is  not 
possible,  the  use  of  OpenMP  on  a  shared  memory  platform  or  a  low-latency 
message-passing  library  such  as  SHMEM  on  an  MPP  with  a  relatively  low- 
message  latency  are  better  choices. 
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STREAM  -  Squares 
NAS  CG  -  Gradients 


Figure  1.  Comparison  of  commonly  used  HPC  benchmarks  (100-200  processors). 


7 


System  type 


Figure  3.  Comparison  of  commonly  used  HPC  benchmarks  (1-16  processors). 
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Figure  6.  Performance  results  for  a  wide  range  of  real  world  codes  (>200  processors). 
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Figure  7.  Comparison  of  commonly  used  HPC  benchmarks  to  real  world  codes  (<100 
processors). 
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Figure  8.  Comparison  of  commonly  used  HPC  benchmarks  to  real  world  codes  (100-200 
processors). 


(/) 

CL 

o 


CD 

o 

c 

to 

E 

l. 

o 

<D 

Q. 

ka 

o 

w 

</) 

0) 

o 

o 


0) 

CL 


Peak  -  Diamonds 


Unpack  Parallel 
Deltas 


Gradients  show  the 
minimum  and  maximum 
values  for  the  NAS  BT,  CG, 
LU,  and  SP  benchmarks 

Circles  show  the  minimum 
and  maximum  values  reported 
for  real  world  codes 


System  type 


Figure  9.  Comparison  of  commonly  used  HPC  benchmarks  to  real  world  codes  (>200 
processors). 
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Table  1.  The  performance  of  commonly  used  systems  within  the  DOD  HPCMP  on  commonly  referenced  benchmarks. 


Peak  per 
Processor 

1 

1334 

8 

Os 

1200 

1334 

o 

a 

* 

* 

00 

00 

00 

1500 

8 

00 

1500 

o 

8 

o 

o 

m 

o 

o 

NO 

o 

o 

00 

o 

o 

00 

NAS  Class  B  per  Processor 
(MFLOPS) 

>200  Processors  1 

1 

| 

(A 

[51,  [37] 
est. 

s 

•M 

8 

s 

[5],  est. 

8 

3 

[4],  est. 

3 

[6],  est. 

[61  [36] 
est. 

f 

VO 

p— i 

[7],  est. 

[8],  est. 

[38],  est. 

[38],  est. 

i 

I 

CD 

§ 

8 

s 

CM 

NO 

8 

K 

S 

NO 

NO 

9 

NO 

NO 

<N 

51 

8 

rH 

in 

rH 

3 

3 

S3 

8 

R 

00 

in 

rH 

o 

00 

00 

NO 

g 

© 

rH 

8 

rH 

8 

CM 

NO 

rH 

On 

CD 

g 

cR 

S 

CM 

8 

3 

* 

rH 

rH 

CM 

rH 

eg 

On 

o 

rH 

cm 

rH 

3 

rH 

rH 

ON 

rH 

o 

rH 

rH 

rH 

ts 

CO 

rH 

n) 

8 

in 

rH 

p3 

S 

rH 

8 

8 

a 

ON 

rH 

s. 

8 

On 

O 

o 

rH 

rH 

NO 

rH 

O 

ON 

rH 

NO 

rH 

8 

a 

rH 

3 

100-200  Processors 

Reference 

F? 

CO, 

^  8 
HI 

3 

[4],  est. 

8 
r— i 

in 

[4],  est. 

[4],  est. 

3 

[6],  est. 

[6],  [36] 
est. 

1— 'l 

VO 

W— J 

r— i 

v£) 

k_j 

SB 

3 

3 

[38],  [38], 
est. 

i 

s 

00 

co 

8 

I 

85 

§ 

CO 

3 

3 

LO 

00 

NO 

8 

3 

CD 

NO 

3 

?! 

51 

?! 

rH 

in 

rH 

!8 

3 

rH 

°o 

es 

8 

R 

8 

rH 

8 

s 

ON 

a 

rH 

1 

rH 

rH 

rH 

s 

CM 

CO 

g 

8 

© 

CM 

O 

ON 

8 

cR 

a 

R 

co 

OTN 

Cx 

rH 

a 

a 

a 

ON 

rH 

CM 

a 

B 

CO 

rH 

a 

§ 

in 

rH 

pa 

§8 

rH 

rH 

in 

g 

8 

rH 

ON 

8 

rH 

00 

rH 

rH 

a 

3 

8 

rH 

§ 

rH 

9 

8 

a 

rH 

ON 

Unpack  Parallel  per 
Processor 

8 

1 

3 

3 

3 

1 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

00 

CO 

2 

1 

1015 

a 

2 

S 

1015 

« 

* 

1 

* 

1023 

Cn 

in 

1106 

a 

CO 

CM 

rH 

00 

• 

CO 

rH 

tN 

Stream  Triad  1 
Processor 

Reference 

rH 

rH 

h— j 

rH 

h— J 

1 

rn 

rH 

t— ^ 

rH 

i— u 

t 

r— < 
rH 

1 

rH 

rH 

[1],  est. 

1 

rH 

1 

rH 

f— ^ 

rH 

rH 

3 

rH 

i—i 

rH 

(MFLOPS) 

111.5 

47.3 

46.5 

00 

65.6 

65.6 

NO 

in 

NO 

a 

rH 

in 

51.2 

51.2 

51.2 

V9Z 

00 

ON 

CM 

CO 

CM 

CO 

00 

CM 

CO 

24.7 

System  Type 

Compaq  SC-667 

Cray  T3E-900 

Cray  T3E-1200 

HPTi  ACL-667 

o 

CM 

rH 

t 

Ph 

CD 

1 

|IBM  SP  P2SC-135 

o 

NO 

rH 

% 

8s 

I 

IBM  SP  P3-HIGH-222 

8 

3 

sp 

Ph 

CL, 

CD 

1 

IBM  SP  P3-THIN-200 

[IBM  SP  P3-THIN-375 

[SGI  02K-195 

SGI  O2K-250 

SGI  02K-300 

SGI  03K-400 

[SUN  HPC10000-400 

12 


Table  3.  The  performance  of  commonly  used  systems  within  the  DOD  HPCMP  as  reported  for  real  world  codes. 


Reference 

processors  i 

5T  tr** 

CO  CO, 

[16] 

[16] 

[17] 

[19] 

[20] 

[20] 

[24] 

[25] 

125] 

[27] 

[30] 

[40] 

_ [50] _ 

I 

1 

p— i 

rH 

,co. 

Performance  per 
Processor 
(MFLOPS) 

125 

188 

117 

156 

32 

64 

35 

27 

75 

32  ! 

42 

130 

72 

80-100 

72 

68 

36 

195 

LO  ON 

ssq  a  9 

Number  of 
Processors  Used 

64 

64 

\X>vioN\£>vo'0'>ovovc>,i<a4iJO  in 

64 

88 

8 

64 

80 

3  8  S  3 

CTA 

Jobs  using  less  than  100 1 

CWO 

CWO 

CCM 

CCM 

CWO 

CCM 

CWO 

CWO 

CWO 

CWO 

CWO 

CFD 

CFD 

CWO 

CFD 

CFD 

CWO 

CFD 

CFD 

CFD 

CFD 

CFD 

CWO 

Program  Name 

CCM/MP-2D 

MM5 

Paratec 

Paratec 

Ocean/Wallcraft 

NAMD 

CCM/MP-2D 

CCM/MP-2D 

Ocean/Wallcraft 

PCM 

CCM3 

FE-MIMD 

Uncle 

PSTSWM 

SUBOFF 

RIEMANN 

F3D-MIMD 

MM5 

CG+Schwarz/  Rich. 
FUN3D 

Overflow 

Overflow 

Overflow 

MM5 

System  Type 

Compaq  SC-667 

Cray  T3E-900 

Cray  T3E-1200 

|HPTI  ACL-667 

IBM  SP  P2SC-120 

IBM  SP  P2SC-135 

14 


Table  3.  The  performance  of  commonly  used  systems  within  the  DOD  HPCMP  as  reported  for  real  world  codes  (continued). 
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Table  3.  The  performance  of  commonly  used  systems  within  the  DOD  HPCMP  as  reported  for  real  world  codes  (continued). 
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Table  4.  A  comparison  of  benchmark  results  to  reported  performance  levels  for  real  world  codes  for  commonly  used  systems  within  the 
DODHPCMP. 
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Note:  The  data  for  this  table  is  a  summary  of  the  data  from  Tables  1  and  3. 
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